This publication was designed to serve as a walk through my thought process and methods when exploring and cleaning data on NYPD Motor Vehicle Collisions Reports in 2018.

The data was retrieved from NYC Open Data - NYPD Motor Vehicle Collisions.

Goal

Using NYPD Motor Vehicle Collision Reports from the year 2018, this document analyzes:

Data Preliminary Cleaning and Filtering

Since both questions are interested in motor vehicle collisions from 2018, let’s do some preliminary data cleaning and filter for 2018 data.

  1. Load Libraries
library(dplyr)
library(tidyr)
library(tidyverse)
library(lubridate)
library(ggplot2)
library(plotly)
  1. Import Data
df <- read.csv("NYPD_Motor_Vehicle_Collisions.csv")
  1. Filter for 2018 Data
df$DATE <- as.Date(df$DATE, format="%m/%d/%Y") # convert DATE variable from factor to date format
str(df$DATE) # confirm date format
df18 <- filter(df, str_detect(DATE, '2018')) # filter for 2018
saveRDS(df18, file = "NYPDCollisionReports_2018") # save filtered dataframe as RDS for easier load
df18 <- readRDS("NYPDCollisionReports_2018") # reload 2018 filtered dataframe 
  1. Explore Data Structure
str(df18)
## 'data.frame':    231491 obs. of  29 variables:
##  $ DATE                         : Date, format: "2018-12-31" "2018-12-31" ...
##  $ TIME                         : Factor w/ 1440 levels "0:00","0:01",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BOROUGH                      : Factor w/ 6 levels "","BRONX","BROOKLYN",..: 2 3 4 5 6 1 1 4 1 1 ...
##  $ ZIP.CODE                     : int  10472 11249 10026 11435 10305 NA NA 10027 NA NA ...
##  $ LATITUDE                     : num  40.8 40.7 40.8 40.7 40.6 ...
##  $ LONGITUDE                    : num  -73.9 -74 -73.9 -73.8 -74.1 ...
##  $ LOCATION                     : Factor w/ 199895 levels "","(0.0, 0.0)",..: 171933 89997 161824 98312 12234 62101 1 163969 159883 77618 ...
##  $ ON.STREET.NAME               : Factor w/ 11298 levels "","                                ",..: 1 1 1 1 9325 2740 7570 1275 1665 6645 ...
##  $ CROSS.STREET.NAME            : Factor w/ 17733 levels "","                                ",..: 1 1 1 1 5235 1 1 16515 1 282 ...
##  $ OFF.STREET.NAME              : Factor w/ 129617 levels "","                                        ",..: 7421 10434 80198 18369 1 1 1 1 1 1 ...
##  $ NUMBER.OF.PERSONS.INJURED    : int  0 0 0 0 0 0 0 1 1 1 ...
##  $ NUMBER.OF.PERSONS.KILLED     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.PEDESTRIANS.INJURED: int  0 0 0 0 0 0 0 1 0 1 ...
##  $ NUMBER.OF.PEDESTRIANS.KILLED : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.CYCLIST.INJURED    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.CYCLIST.KILLED     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NUMBER.OF.MOTORIST.INJURED   : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ NUMBER.OF.MOTORIST.KILLED    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.1: Factor w/ 62 levels "","1","80","Accelerator Defective",..: 58 13 13 58 13 58 58 20 13 20 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.2: Factor w/ 62 levels "","1","80","Accelerator Defective",..: 58 58 58 1 13 58 1 1 13 1 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.3: Factor w/ 51 levels "","1","80","Accelerator Defective",..: 1 1 1 1 1 1 1 1 50 1 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.4: Factor w/ 40 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.5: Factor w/ 28 levels "","Aggressive Driving/Road Rage",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ UNIQUE.KEY                   : int  4055273 4055449 4055005 4055499 4055512 4060683 4062834 4055476 4055542 4055311 ...
##  $ VEHICLE.TYPE.CODE.1          : Factor w/ 639 levels "","(ceme","1",..: 471 506 608 506 506 506 406 471 471 506 ...
##  $ VEHICLE.TYPE.CODE.2          : Factor w/ 608 levels "","00","013",..: 473 473 450 1 450 450 1 1 491 1 ...
##  $ VEHICLE.TYPE.CODE.3          : Factor w/ 130 levels "","2 dr sedan",..: 1 1 1 1 1 1 1 1 98 1 ...
##  $ VEHICLE.TYPE.CODE.4          : Factor w/ 70 levels "","2 dr sedan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VEHICLE.TYPE.CODE.5          : Factor w/ 45 levels "","2 dr sedan",..: 1 1 1 1 1 1 1 1 1 1 ...

There were 231491 total reports of collisions in 2018.

Let’s check for missing values.

summary(df18)
##       DATE                 TIME                 BOROUGH     
##  Min.   :2018-01-01   16:00  :  3521                :82313  
##  1st Qu.:2018-04-09   0:00   :  3491   BRONX        :23047  
##  Median :2018-07-05   17:00  :  3422   BROOKLYN     :47293  
##  Mean   :2018-07-04   15:00  :  3299   MANHATTAN    :31400  
##  3rd Qu.:2018-10-03   14:00  :  3171   QUEENS       :41268  
##  Max.   :2018-12-31   18:00  :  3103   STATEN ISLAND: 6170  
##                       (Other):211484                        
##     ZIP.CODE        LATITUDE       LONGITUDE      
##  Min.   :10000   Min.   : 0.00   Min.   :-201.24  
##  1st Qu.:10314   1st Qu.:40.67   1st Qu.: -73.97  
##  Median :11208   Median :40.72   Median : -73.92  
##  Mean   :10856   Mean   :40.67   Mean   : -73.84  
##  3rd Qu.:11249   3rd Qu.:40.77   3rd Qu.: -73.86  
##  Max.   :11697   Max.   :41.12   Max.   :   0.00  
##  NA's   :82352   NA's   :15117   NA's   :15117    
##                     LOCATION     
##                         : 15117  
##  (0.0, 0.0)             :   294  
##  (40.608757, -74.038086):   281  
##  (40.798256, -73.82744) :   257  
##  (40.604153, -74.05198) :   201  
##  (40.861862, -73.91282) :   171  
##  (Other)                :215170  
##                           ON.STREET.NAME   CROSS.STREET.NAME
##                                  : 55649           :116563  
##  BELT PARKWAY                    :  3176   3 AVENUE:  1272  
##  LONG ISLAND EXPRESSWAY          :  2518   BROADWAY:  1032  
##  BROOKLYN QUEENS EXPRESSWAY      :  2451   2 AVENUE:   833  
##  GRAND CENTRAL PKWY              :  2005   5 AVENUE:   723  
##  BROADWAY                        :  1947   7 AVENUE:   721  
##  (Other)                         :163745   (Other) :110347  
##                                  OFF.STREET.NAME  
##                                          :176555  
##  772       EDGEWATER ROAD                :    92  
##  110-00    ROCKAWAY BOULEVARD            :    87  
##  2800      VICTORY BOULEVARD             :    83  
##  2655      RICHMOND AVENUE               :    53  
##  501       GATEWAY DRIVE                 :    41  
##  (Other)                                 : 54580  
##  NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
##  Min.   : 0.0000           Min.   :0.000000        
##  1st Qu.: 0.0000           1st Qu.:0.000000        
##  Median : 0.0000           Median :0.000000        
##  Mean   : 0.2675           Mean   :0.000981        
##  3rd Qu.: 0.0000           3rd Qu.:0.000000        
##  Max.   :22.0000           Max.   :2.000000        
##  NA's   :5                 NA's   :14              
##  NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
##  Min.   :0.00000               Min.   :0.0000000           
##  1st Qu.:0.00000               1st Qu.:0.0000000           
##  Median :0.00000               Median :0.0000000           
##  Mean   :0.04802               Mean   :0.0005184           
##  3rd Qu.:0.00000               3rd Qu.:0.0000000           
##  Max.   :6.00000               Max.   :2.0000000           
##                                                            
##  NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
##  Min.   :0.00000           Min.   :0.00e+00        
##  1st Qu.:0.00000           1st Qu.:0.00e+00        
##  Median :0.00000           Median :0.00e+00        
##  Mean   :0.02041           Mean   :4.32e-05        
##  3rd Qu.:0.00000           3rd Qu.:0.00e+00        
##  Max.   :2.00000           Max.   :1.00e+00        
##                                                    
##  NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED
##  Min.   : 0.000             Min.   :0.000000         
##  1st Qu.: 0.000             1st Qu.:0.000000         
##  Median : 0.000             Median :0.000000         
##  Mean   : 0.199             Mean   :0.000419         
##  3rd Qu.: 0.000             3rd Qu.:0.000000         
##  Max.   :21.000             Max.   :2.000000         
##                                                      
##                 CONTRIBUTING.FACTOR.VEHICLE.1
##  Driver Inattention/Distraction:57053        
##  Unspecified                   :49972        
##  Following Too Closely         :22272        
##  Failure to Yield Right-of-Way :16355        
##  Backing Unsafely              :11406        
##  Passing or Lane Usage Improper:10671        
##  (Other)                       :63762        
##                 CONTRIBUTING.FACTOR.VEHICLE.2
##  Unspecified                   :161986       
##                                : 34833       
##  Driver Inattention/Distraction: 13157       
##  Following Too Closely         :  3745       
##  Other Vehicular               :  2909       
##  Passing or Lane Usage Improper:  2691       
##  (Other)                       : 12170       
##                 CONTRIBUTING.FACTOR.VEHICLE.3
##                                :216790       
##  Unspecified                   : 13699       
##  Following Too Closely         :   336       
##  Driver Inattention/Distraction:   230       
##  Other Vehicular               :   219       
##  Reaction to Uninvolved Vehicle:    39       
##  (Other)                       :   178       
##                 CONTRIBUTING.FACTOR.VEHICLE.4
##                                :228445       
##  Unspecified                   :  2887       
##  Following Too Closely         :    63       
##  Other Vehicular               :    43       
##  Driver Inattention/Distraction:    22       
##  Reaction to Uninvolved Vehicle:     8       
##  (Other)                       :    23       
##                 CONTRIBUTING.FACTOR.VEHICLE.5   UNIQUE.KEY     
##                                :230679        Min.   :3511951  
##  Unspecified                   :   773        1st Qu.:3879320  
##  Following Too Closely         :    15        Median :3937300  
##  Other Vehicular               :    11        Mean   :3937781  
##  Driver Inattention/Distraction:     4        3rd Qu.:3995270  
##  Alcohol Involvement           :     2        Max.   :4152249  
##  (Other)                       :     7                         
##                           VEHICLE.TYPE.CODE.1
##  Sedan                              :71568   
##  Station Wagon/Sport Utility Vehicle:55293   
##  PASSENGER VEHICLE                  :35478   
##  SPORT UTILITY / STATION WAGON      :27455   
##  Taxi                               : 7382   
##  Pick-up Truck                      : 4801   
##  (Other)                            :29514   
##                           VEHICLE.TYPE.CODE.2
##  Sedan                              :54358   
##                                     :49020   
##  Station Wagon/Sport Utility Vehicle:43386   
##  PASSENGER VEHICLE                  :25986   
##  SPORT UTILITY / STATION WAGON      :20783   
##  (Other)                            :37956   
##  NA's                               :    2   
##                           VEHICLE.TYPE.CODE.3
##                                     :217636  
##  Sedan                              :  4659  
##  Station Wagon/Sport Utility Vehicle:  3828  
##  PASSENGER VEHICLE                  :  2056  
##  SPORT UTILITY / STATION WAGON      :  1791  
##  Taxi                               :   316  
##  (Other)                            :  1205  
##                           VEHICLE.TYPE.CODE.4
##                                     :228619  
##  Sedan                              :  1053  
##  Station Wagon/Sport Utility Vehicle:   782  
##  PASSENGER VEHICLE                  :   432  
##  SPORT UTILITY / STATION WAGON      :   371  
##  Taxi                               :    50  
##  (Other)                            :   184  
##                           VEHICLE.TYPE.CODE.5
##                                     :230729  
##  Sedan                              :   263  
##  Station Wagon/Sport Utility Vehicle:   206  
##  PASSENGER VEHICLE                  :   117  
##  SPORT UTILITY / STATION WAGON      :   104  
##  Pick-up Truck                      :    20  
##  (Other)                            :    52

There are NAs that exist. We’ll continue with our analysis and make sense of them as needed by dive deeper into our analysis.

Question 1

In 2018, what were the best and worst times to be driving if collision safety was your only consideration?

Method

  1. Extract Important Variables
  2. Explore Filtered Data
  3. Visualize relationship of collision safety and time
  4. Conclude best/worst times to be driving in regards to collision safety
  5. Discuss limitations and concerns

1. Extract Important Variables

A row indicates everytime a motor vehicle collision occurred. If we only want to examine the relationship between time and vehicle collision, let’s filter for the date, time, and unique.key from the 2018 data. The unique.key acts as an identifier for individually reported collisions, and we can analyze the relationship between the unique.key (or the occurrence of collisions) based on date and time.

df18_Time <- select(df18, DATE, TIME, UNIQUE.KEY)

Let’s go ahead and combine date and time into one variable using the R package lubridate.

datetime <- as.POSIXct(paste(df18_Time$DATE, df18_Time$TIME), format="%Y-%m-%d %H:%M")
df18_Time$DATETIME <- datetime

Let’s convert the stand alone TIME variable from factor to hours and minute using the R package lubridate.

df18_Time$TIME <- hm(df18_Time$TIME)

2. Explore Filtered Data

str(df18_Time)
## 'data.frame':    231491 obs. of  4 variables:
##  $ DATE      : Date, format: "2018-12-31" "2018-12-31" ...
##  $ TIME      :Formal class 'Period' [package "lubridate"] with 6 slots
##   .. ..@ .Data : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ hour  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ minute: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ UNIQUE.KEY: int  4055273 4055449 4055005 4055499 4055512 4060683 4062834 4055476 4055542 4055311 ...
##  $ DATETIME  : POSIXct, format: "2018-12-31 00:00:00" "2018-12-31 00:00:00" ...
summary(df18_Time)
##       DATE                 TIME                          
##  Min.   :2018-01-01   Min.   :0S                         
##  1st Qu.:2018-04-09   1st Qu.:9H 44M 0S                  
##  Median :2018-07-05   Median :14H 17M 0S                 
##  Mean   :2018-07-04   Mean   :13H 36M 3.88991364675167S  
##  3rd Qu.:2018-10-03   3rd Qu.:17H 50M 0S                 
##  Max.   :2018-12-31   Max.   :23H 59M 0S                 
##    UNIQUE.KEY         DATETIME                  
##  Min.   :3511951   Min.   :2018-01-01 00:00:00  
##  1st Qu.:3879320   1st Qu.:2018-04-09 16:04:00  
##  Median :3937300   Median :2018-07-05 09:04:00  
##  Mean   :3937781   Mean   :2018-07-05 11:19:20  
##  3rd Qu.:3995270   3rd Qu.:2018-10-03 19:04:30  
##  Max.   :4152249   Max.   :2018-12-31 23:59:00
head(df18_Time)
##         DATE TIME UNIQUE.KEY   DATETIME
## 1 2018-12-31   0S    4055273 2018-12-31
## 2 2018-12-31   0S    4055449 2018-12-31
## 3 2018-12-31   0S    4055005 2018-12-31
## 4 2018-12-31   0S    4055499 2018-12-31
## 5 2018-12-31   0S    4055512 2018-12-31
## 6 2018-12-31   0S    4060683 2018-12-31

The number of 2018 observations stay the same but our columns have decreased to represent 3 variables (i.e. date, time, and unique.key).

The date ranges from 2018-01-01 to 2018-12-31. The hour/minute ranges from 0 seconds to 23 hours and 59 Minutes.

The date/hour/minute ranges from 2018-01-01 00:00:00 to 2018-12-31 23:59:00.

df18_Time$UNIQUE.KEY %>% as.factor() %>% unique() %>% as.data.frame() %>% nrow()
## [1] 231491

There are a total of 231491 unique keys.

sum(is.na(df18_Time$DATE)) 
## [1] 0
sum(is.na(df18_Time$TIME)) 
## [1] 0
sum(is.na(df18_Time$UNIQUE.KEY)) 
## [1] 0
sum(is.na(df18_Time$DATETIME)) 
## [1] 0

There are no NAs.

3. Visualize relationship of collision safety and time

By Month

topMonth <- df18_Time %>% group_by(month(df18_Time$DATETIME)) %>% count() %>% as.data.frame()
topMonth
##    month(df18_Time$DATETIME)     n
## 1                          1 18123
## 2                          2 15989
## 3                          3 19273
## 4                          4 18195
## 5                          5 20842
## 6                          6 20790
## 7                          7 19692
## 8                          8 19632
## 9                          9 19226
## 10                        10 20810
## 11                        11 19426
## 12                        12 19493
mean(topMonth$n)
## [1] 19290.92
df18_Time %>% 
  mutate(month = month(df18_Time$DATETIME)) %>% 
  ggplot(aes(x = month)) +
    geom_bar() + scale_x_continuous("Month", breaks=c(1,2,3,4,5,6,7,8,9,10,11,12)) + 
  geom_hline(yintercept = mean(topMonth$n)) + 
  annotate(geom="text", x=2, y=20000, label="mean = 19290.92",
              color="blue") +
  ggtitle("Reported Motor Vehicle Collisions by Month")

On average, there were 19290.92 motor vehicle collisions reported per month in the year of 2018. May (20842), October (20810), and June (20790) were the top 3 months where motor vehicle collisions were most reported. February (15989) has the least reports, but note that February is also the shortest calendar month, so there are less days to report collisions than other months.

By Weekday

topWeekday <- df18_Time %>% group_by(wday(df18_Time$DATETIME)) %>% count() %>% as.data.frame()
mean(topWeekday$n)
## [1] 33070.14
df18_Time %>% 
  mutate(wday = wday(df18_Time$DATETIME, label = TRUE)) %>% 
  ggplot(aes(x = wday)) +
  geom_bar() + ggtitle("Reported Motor Vehicle Collisions by Weekday") + 
  annotate(geom="text", x=5.5, y=32000, label="mean = 33070.14",
              color="blue") +   
  geom_hline(yintercept = mean(topWeekday$n)) + 
  geom_text(stat='count', aes(label=..count..), vjust=0)

On an average weekday, there are 33070.14 reported vehicle collisions. Fridays had the top reported motor vehicle collisions at 36919. Sundays had the least reported motor vehicle collisions at 26577.

By Hour

topHour <- df18_Time %>% group_by(hour(df18_Time$DATETIME)) %>% count() %>% as.data.frame()
colnames(topHour) <- c("hour","n")
p <- ggplot(data=topHour,aes(x=hour, y=n)) +
    geom_bar(stat="identity") + ggtitle("Reported Motor Vehicle Collisions by Hour")
ggplotly(p)

Since there are 24 hours, it would be messy to have the counts on the graph. As an alternative, we will use an interactive graph so that we can hover over the bar to see the hour and count of reported vehicle collisions for that hour. From the graph, we see that the distribution of reported motor vehicle collisions tends to happen between the hours of 14-17. We know that there tends to be rush hour between that time, so the increase in traffic of people commuting from work during that time may result in more collisions. Similarly, there is a peak between the hours of 8 and 9, which is also rush hour of when people are commuting to work.

By Minute

topMinute <- df18_Time %>% group_by(minute(df18_Time$DATETIME)) %>% count() %>% as.data.frame()
colnames(topMinute) <- c("minute","n")
p2 <- ggplot(data=topMinute,aes(x=minute, y=n)) +
    geom_bar(stat="identity") + ggtitle("Reported Motor Vehicle Collisions by Minute")
ggplotly(p2)

Since we have the data for minute, let’s take a look at which minute of the hour have the most reported motor vehicle collisions. From our plot, minutes 0 and 30 have the highest reports.

During moments of collision, perhaps it’s not always easy to capture the exact minute so capturing the time by the half hour seems reasonable.

Furthermore, we see there is a peak in reports every 5th minute (e.g. 5, 10, 15, 20…55). It’s probably easier to confidently report a division of the hour by a multiple of 5 than it is a multiple of 2 for instance.

Minutes seems too granular and inaccurate so we won’t include this analysis as a data highlight in the final report.

4. Conclude best/worst times to be driving in regards to collision safety

Based on reports of motor vehicle collision in the 2018 calendar year, we see the most often reports are happening during the month of May, on Friday, and during rush hour (8-9 in the morning) or (16-17 in the late afternoon) on the hour.

5. Discuss limitations and concerns

We looked at the top reports by month, weekday, hour, and minute. However, there was on average 19290.92 reports a month, or from a different perspective using the same data, there was on average 33070.14 reports on any given weekday. Furthermore, there was a left skewed distribution where most reports are between the hours of 8-0 when people are generally not sleeping compared to the hours of 0-8.

There were a total of 231481 official motor vehicle collision reports in the year of 2018. We’re not sure if this total amount is unusually low or high, so it would be interesting to compare total reports and patterns across previous years. That would help us understand how our sample size of reports from 2018 compare to the true mean and distribution of all historically reported motor vehicle collisions.

Question 2

In 2018, what effect did vehicle type have on injury and mortality rate given an accident?

Method

  1. Explore Data
  2. Extract Important Variables
  3. Visualize relationship between vehicle type and injury
  4. Visualize relationship between vehicle type and mortality
  5. Conclude relationship between vehicle type, injury, and mortality rate
  6. Discuss limitations and concerns

1. Explore Data

colnames(df18)
##  [1] "DATE"                          "TIME"                         
##  [3] "BOROUGH"                       "ZIP.CODE"                     
##  [5] "LATITUDE"                      "LONGITUDE"                    
##  [7] "LOCATION"                      "ON.STREET.NAME"               
##  [9] "CROSS.STREET.NAME"             "OFF.STREET.NAME"              
## [11] "NUMBER.OF.PERSONS.INJURED"     "NUMBER.OF.PERSONS.KILLED"     
## [13] "NUMBER.OF.PEDESTRIANS.INJURED" "NUMBER.OF.PEDESTRIANS.KILLED" 
## [15] "NUMBER.OF.CYCLIST.INJURED"     "NUMBER.OF.CYCLIST.KILLED"     
## [17] "NUMBER.OF.MOTORIST.INJURED"    "NUMBER.OF.MOTORIST.KILLED"    
## [19] "CONTRIBUTING.FACTOR.VEHICLE.1" "CONTRIBUTING.FACTOR.VEHICLE.2"
## [21] "CONTRIBUTING.FACTOR.VEHICLE.3" "CONTRIBUTING.FACTOR.VEHICLE.4"
## [23] "CONTRIBUTING.FACTOR.VEHICLE.5" "UNIQUE.KEY"                   
## [25] "VEHICLE.TYPE.CODE.1"           "VEHICLE.TYPE.CODE.2"          
## [27] "VEHICLE.TYPE.CODE.3"           "VEHICLE.TYPE.CODE.4"          
## [29] "VEHICLE.TYPE.CODE.5"
  1. Extract Important Variables

We are interested in looking at the effect of vehicle type to injury and mortality rates. So let’s pull those variables out.

vtype <- df18[,11:29]
vtype <- vtype[,-c(9:13)] %>% select(-UNIQUE.KEY)
head(vtype)
##   NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## 1                         0                        0
## 2                         0                        0
## 3                         0                        0
## 4                         0                        0
## 5                         0                        0
## 6                         0                        0
##   NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
## 1                             0                            0
## 2                             0                            0
## 3                             0                            0
## 4                             0                            0
## 5                             0                            0
## 6                             0                            0
##   NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
## 1                         0                        0
## 2                         0                        0
## 3                         0                        0
## 4                         0                        0
## 5                         0                        0
## 6                         0                        0
##   NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED
## 1                          0                         0
## 2                          0                         0
## 3                          0                         0
## 4                          0                         0
## 5                          0                         0
## 6                          0                         0
##                   VEHICLE.TYPE.CODE.1                 VEHICLE.TYPE.CODE.2
## 1                               Sedan Station Wagon/Sport Utility Vehicle
## 2 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle
## 3                                 Van                               Sedan
## 4 Station Wagon/Sport Utility Vehicle                                    
## 5 Station Wagon/Sport Utility Vehicle                               Sedan
## 6 Station Wagon/Sport Utility Vehicle                               Sedan
##   VEHICLE.TYPE.CODE.3 VEHICLE.TYPE.CODE.4 VEHICLE.TYPE.CODE.5
## 1                                                            
## 2                                                            
## 3                                                            
## 4                                                            
## 5                                                            
## 6

We ultimately want a table that describes the vehicle type, the count of injuries in total and by injury type, and the count of mortality rates in total and by injury type. Our current table has up to 5 vehicles that are reported as part of the collision.

We’ll first sum the count of total injuries and mortality rates for each report.

vtype$total_injury <- vtype$NUMBER.OF.PERSONS.INJURED + vtype$NUMBER.OF.PEDESTRIANS.INJURED + vtype$NUMBER.OF.CYCLIST.INJURED + vtype$NUMBER.OF.MOTORIST.INJURED

vtype$total_mortality <- vtype$NUMBER.OF.PERSONS.KILLED + vtype$NUMBER.OF.PEDESTRIANS.KILLED + vtype$NUMBER.OF.CYCLIST.KILLED + vtype$NUMBER.OF.MOTORIST.KILLED

vtype <- vtype[,-(1:8)]
head(vtype)
##                   VEHICLE.TYPE.CODE.1                 VEHICLE.TYPE.CODE.2
## 1                               Sedan Station Wagon/Sport Utility Vehicle
## 2 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle
## 3                                 Van                               Sedan
## 4 Station Wagon/Sport Utility Vehicle                                    
## 5 Station Wagon/Sport Utility Vehicle                               Sedan
## 6 Station Wagon/Sport Utility Vehicle                               Sedan
##   VEHICLE.TYPE.CODE.3 VEHICLE.TYPE.CODE.4 VEHICLE.TYPE.CODE.5 total_injury
## 1                                                                        0
## 2                                                                        0
## 3                                                                        0
## 4                                                                        0
## 5                                                                        0
## 6                                                                        0
##   total_mortality
## 1               0
## 2               0
## 3               0
## 4               0
## 5               0
## 6               0

Next we’ll gather all vehicle types into one column. We’ll filter out NAs and any empty cells. After, we’ll remove the vehicle report number (i.e. vehicle 1, vehicle 2… vehicle 5 reported in the collision) for our final table.

vtype <- gather(vtype, "vehicle_number", "vehicle_type", 1:5, na.rm=TRUE) %>% filter(vehicle_type>0) %>% select(-vehicle_number)
## Warning: attributes are not identical across measure variables;
## they will be dropped
head(vtype)
##   total_injury total_mortality                        vehicle_type
## 1            0               0                               Sedan
## 2            0               0 Station Wagon/Sport Utility Vehicle
## 3            0               0                                 Van
## 4            0               0 Station Wagon/Sport Utility Vehicle
## 5            0               0 Station Wagon/Sport Utility Vehicle
## 6            0               0 Station Wagon/Sport Utility Vehicle

Right now, our table has repeated vehicle_types (i.e. sedan appears more than once) because each row represents a report where that vehicle_type was involved, in addition to the total_injury and total_mortality counts that occurred in that report. Let’s go ahead and make a final table that groups by vehicle_type and sums total_injury and total_mortality. This will ultimately give us a table of unique vehicle_types, that is not repeated vehicle types, with a count of injuries in total and by injury type, and a count of mortality in total and by mortality type across reports in 2018.

vtype_final <- vtype %>% group_by(vehicle_type) %>% summarise_all(funs(sum))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
##   # Before:
##   funs(name = f(.))
## 
##   # After: 
##   list(name = ~ f(.))
## This warning is displayed once per session.
vtype_final
## # A tibble: 637 x 3
##    vehicle_type total_injury total_mortality
##    <chr>               <int>           <int>
##  1 00                      0               0
##  2 013                     0               0
##  3 1                       2               0
##  4 12 Pa                   0               0
##  5 15 Pa                   0               0
##  6 18 Wh                   0               0
##  7 18 WH                   0               0
##  8 2 DOO                   0               0
##  9 2 TON                   0               0
## 10 2- to                   0               0
## # … with 627 more rows

Immediately we can see that there are 637 vehicle_type that have been reported but those vehicle_type values look messy. For example Amb, AMB, ambu, AMBU, ambul, Ambul, AMbul, AMBUL, Ambulance, and AMBULANCE all seem to reference Ambulance. This part of the cleaning would require more context as to how the vehicle_type values were inputted into the data file. Since we don’t want to assume what vehicle_types belong to each other, we can hold off on that. However, if we did have that information, we would be able to find and match strings, and recalculate the total_injury, for example, all vehicle_types that are related to AMBULANCE we could recalculate total sums for.

Also note that there are 0 values. This means that though that vehicle_type was involved in a reported motor vehicle collision, there were no reports of injury or mortality.

3. Visualize relationship between vehicle type and injury

Top Vehicle Types With Greatest Reports of Injury

Top 10

ggplot(top_n(vtype_final, 10, total_injury), aes(x=reorder(vehicle_type, total_injury), y=total_injury)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Vehicle Type") + ylab("Total Injury Count") + ggtitle("Top 10 Vehicle Types with Greatest Reports of Injury") + geom_text(aes(label=total_injury),hjust=-.2, vjust=.5) + ylim(0,40000) + coord_flip() + theme(plot.title=element_text( hjust=2, vjust=0.5, face='bold'))

Passenger Vehicle (35042) and Sport Utility/Station Wagons (26295) have the highest reports of injury.

Top 20

ggplot(top_n(vtype_final, 20, total_injury), aes(x=reorder(vehicle_type, total_injury), y=total_injury)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Vehicle Type") + ylab("Total Injury Count") + ggtitle("Top 20 Vehicle Types with Greatest Reports of Injury") + geom_text(aes(label=total_injury),hjust=-.2, vjust=.5) + ylim(0,40000) + coord_flip() + theme(plot.title=element_text( hjust=2, vjust=0.5, face='bold'))

When we expand to top 20, we see the words pick-up truck appear again since the string formatting is different, however they likely represent the same vehicle_type. Notably, we also have Convertable, a Tractor Truck Diesel (another type of Truck), and Ambulance joining the top 20 list.

Vehicle Type Effect on Injury Type

Let’s explore who the top vehicle types (i.e Passenger Vehicle (35042) and Sport Utility/Station Wagons (26295)) are injuring.

To explore this we’ll need our data formatted in a way we can group by injury type. The code for that is below.

vtype <- df18[,11:29]
vtype <- vtype[,-c(9:13)] %>% select(-UNIQUE.KEY)

vtype_group <- vtype %>% gather("type_injury", "injury_count", c("NUMBER.OF.PERSONS.INJURED","NUMBER.OF.PEDESTRIANS.INJURED","NUMBER.OF.CYCLIST.INJURED","NUMBER.OF.MOTORIST.INJURED"))

vtype_group <- vtype_group %>% gather("type_mortality", "mortality_count", c("NUMBER.OF.PERSONS.KILLED","NUMBER.OF.PEDESTRIANS.KILLED","NUMBER.OF.CYCLIST.KILLED","NUMBER.OF.MOTORIST.KILLED"))

vtype_group <- vtype_group %>% gather("vehicle_number", "vehicle_type", 1:5)
head(vtype)
##   NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## 1                         0                        0
## 2                         0                        0
## 3                         0                        0
## 4                         0                        0
## 5                         0                        0
## 6                         0                        0
##   NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
## 1                             0                            0
## 2                             0                            0
## 3                             0                            0
## 4                             0                            0
## 5                             0                            0
## 6                             0                            0
##   NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
## 1                         0                        0
## 2                         0                        0
## 3                         0                        0
## 4                         0                        0
## 5                         0                        0
## 6                         0                        0
##   NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED
## 1                          0                         0
## 2                          0                         0
## 3                          0                         0
## 4                          0                         0
## 5                          0                         0
## 6                          0                         0
##                   VEHICLE.TYPE.CODE.1                 VEHICLE.TYPE.CODE.2
## 1                               Sedan Station Wagon/Sport Utility Vehicle
## 2 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle
## 3                                 Van                               Sedan
## 4 Station Wagon/Sport Utility Vehicle                                    
## 5 Station Wagon/Sport Utility Vehicle                               Sedan
## 6 Station Wagon/Sport Utility Vehicle                               Sedan
##   VEHICLE.TYPE.CODE.3 VEHICLE.TYPE.CODE.4 VEHICLE.TYPE.CODE.5
## 1                                                            
## 2                                                            
## 3                                                            
## 4                                                            
## 5                                                            
## 6

Vehicle Type Effect on Injury Type

x <- select(vtype_group, vehicle_type, type_injury, injury_count) %>% filter(vehicle_type=="PASSENGER VEHICLE" | vehicle_type=="SPORT UTILITY / STATION WAGON" | vehicle_type=="Bike") %>% group_by(vehicle_type, type_injury) %>% summarise(sum(injury_count))
colnames(x) <- c("vehicle_type", "type_injury", "injury_count")

ggplot(x, aes(x=type_injury, y=injury_count, fill=vehicle_type)) + geom_bar(stat="identity") + facet_grid(vehicle_type~.) + ylab("Total Injury Count") + ggtitle("Top 3 Vehicle Types and Most Reported Injury Type") + theme(plot.title=element_text(face='bold')) + guides(fill=FALSE) + scale_x_discrete("Injury Type", labels=c("Cyclist", "Motorist", "Pedestrians", "Persons"))

  • Cyclists and Persons are most hurt on bikes.
  • Motorists and Persons are most hurt on Passenger Vechicles and Sport Utility/Station Wagons.

Data Table

I’ll include the exact numbers on the top 3 vehicle types causing injuries and the injury type (or individual who was injured) in the table below.

DT::datatable(x)

4. Visualize relationship between vehicle type and mortality

Top Vehicle Types With Greatest Reports of Mortality

Top 10

ggplot(top_n(vtype_final, 10, total_mortality), aes(x=reorder(vehicle_type, total_mortality), y=total_mortality)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Vehicle Type") + ylab("Total Mortality Count") + ggtitle("Top 10 Vehicle Types with Greatest Reports of Mortality") + geom_text(aes(label=total_mortality),hjust=-.2, vjust=.5) + ylim(0, 100) + coord_flip() + theme(plot.title=element_text( hjust=1.5, vjust=0.5, face='bold'))

Passenger Vehicle (86), Sport Utility/Station Wagon (76), and Motorcycle (64) have the highest reports of mortality.

Top 20

ggplot(top_n(vtype_final, 20, total_mortality), aes(x=reorder(vehicle_type, total_mortality), y=total_mortality)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("Vehicle Type") + ylab("Total Mortality Count") + ggtitle("Top 20 Vehicle Types with Greatest Reports of Mortality") + geom_text(aes(label=total_mortality),hjust=-.2, vjust=.5) + ylim(0, 100) + coord_flip() + theme(plot.title=element_text( hjust=1.5, vjust=0.5, face='bold'))

When we expand to top 20, we see Motorscooter, MOTORCYCLE, Motorbike, Moped, and Minicycle. It seems these represent two wheeled vehicles similar to a Motorcycle which is in the top 3 vehicle types with greatest reports of mortality.

Vehicle Type Effect on Mortality Type

Finally, let’s look at mortality types of the top vehicle types with the highest reports of mortality (i.e Passenger Vehicle (35042) and Sport Utility/Station Wagons (26295)).

x <- select(vtype_group, vehicle_type, type_mortality, mortality_count) %>% filter(vehicle_type=="PASSENGER VEHICLE" | vehicle_type=="SPORT UTILITY / STATION WAGON" | vehicle_type=="Motorcycle") %>% group_by(vehicle_type, type_mortality) %>% summarise(sum(mortality_count))
colnames(x) <- c("vehicle_type", "type_mortality", "mortality_count")

ggplot(x, aes(x=type_mortality, y=mortality_count, fill=vehicle_type)) + geom_bar(stat="identity") + facet_grid(vehicle_type~.) + ylab("Total Mortality Count") + ggtitle("Top 3 Vehicle Types and Most Reported Mortality Type") + theme(plot.title=element_text(face='bold')) + guides(fill=FALSE) + scale_x_discrete("Mortality Type", labels=c("Cyclist", "Motorist", "Pedestrians", "Persons"))

Data Table

The table below includes the exact numbers on the top 3 vehicle types causing deaths and the death type (or individual who was killed) in the table below.

DT::datatable(x)

5. Conclude relationship between vehicle type, injury, and mortality rate

We analyzed the 637 unique vehicle types and the following information was apparent. Passenger Vehicle (35042, 86) and Sport Utility/Station Wagons (26295, 76) had the highest reports of injury and mortality.

Bike (8071) came in third for highest report of injury, while Motorcycle came in third for highest report of mortality (64).

In investigating top 20 highest reports of mortality, we saw a pattern in 2 wheeled vehicle types, for example the types of Motorscooter, MOTORCYCLE, Motorbike, Moped, and Minicycle also joined the top 20 list with Motorcycle. 2 wheeled vehicles appeared more often in top 20 highest reports of mortality than top 20 highest reports of injury. This suggested that while there are high reports of injury on 2 wheeled vehicle types (e.g. motorcycles), there are more often high reports of mortality on 2 wheeled vehicle types. Furthermore, in investigating the impact vehicle types on injuries and deaths, we found that two wheeled vehicles such as motorcycle and bikes result mostly in an injury of the biker or motorcyclist and surrounding persons during the collision.

Overall, we found that persons and motorcyclists were the most injured, while persons, motorcylists, and pedestrians were most killed during the top 3 motor vehicle collisions. Sport Utility/Station Wagon and Passenger Vehicle vehicle types shared similar patterns in injuring mostly motorists and persons and also killing persons, pedestrians, and motorists. Note: Sport Utility/Station Wagon had the only 4 reports of cyclists deaths out of the three.

6. Discuss limitations and concerns

We did not combine any possibly related strings of vehicle_types (e.g. AMmb, Ambul, and Ambulance) during our analysis since we did not want to make any wrong assumptions. However, in the future, this could be done by finding and matching approriate strings, and then we would recalculate the total injury and total mortality rates based on vehicle types.

Also note that there were 0 values. This means that though vehicle_type was involved in a reported motor vehicle collision, there were no reports of injury or mortality. This did not apply to our presented analysis but a future analysis would consider an average rate per vehicle type of injury or mortality, which would require all reports, including those with 0 values. So for each vehicle type the average rate would be calculated by dividing the total number of injuries by the total number of that vehicle type.